Tutorial Brief

Pandas is powerful and easy-to-use library for data analysis. Is has two main object to represents data: Series and DataFrame.

Finding Help:

	NumPy Base N-dimensional array package		SciPy Fundamental library for scientific computing		Matplotlib Comprehensive 2D Plotting
	IPython Enhanced Interactive Console		SymPy Symbolic mathematics		Pandas Data structures & analysis

Import libraries



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Working with Series

Series is an array like object.

pd.Series(self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)



In [2]:

    
x = pd.Series([1,2,3,4,5])
x









    Out[2]:





0    1
1    2
2    3
3    4
4    5
dtype: int64

Notice that generated an index for your item

Basic Operation



In [3]:

    
x + 100









    Out[3]:





0    101
1    102
2    103
3    104
4    105
dtype: int64



In [4]:

    
(x ** 2) + 100









    Out[4]:





0    101
1    104
2    109
3    116
4    125
dtype: int64



In [5]:

    
x > 2









    Out[5]:





0    False
1    False
2     True
3     True
4     True
dtype: bool

`any()` and `all()`



In [6]:

    
larger_than_2 = x > 2
larger_than_2









    Out[6]:





0    False
1    False
2     True
3     True
4     True
dtype: bool



In [7]:

    
larger_than_2.any()









    Out[7]:





True



In [8]:

    
larger_than_2.all()









    Out[8]:





False

`apply()`



In [9]:

    
def f(x):
    if x % 2 == 0:
        return x * 2
    else:
        return x * 3

x.apply(f)









    Out[9]:





0     3
1     4
2     9
3     8
4    15
dtype: int64

Avoid looping over your data

This is a %%timeit results from apply() and a for loop.



In [10]:

    
%%timeit

ds = pd.Series(range(10000))

for counter in range(len(ds)):
    ds[counter] = f(ds[counter])









    



1 loops, best of 3: 241 ms per loop



In [11]:

    
%%timeit

ds = pd.Series(range(10000))

ds = ds.apply(f)









    



10 loops, best of 3: 40 ms per loop

`astype()`



In [12]:

    
x.astype(np.float64)









    Out[12]:





0    1
1    2
2    3
3    4
4    5
dtype: float64

`copy()`



In [13]:

    
y = x



In [14]:

    
y[0]









    Out[14]:





1



In [15]:

    
y[0] = 100



In [16]:

    
y









    Out[16]:





0    100
1      2
2      3
3      4
4      5
dtype: int64



In [17]:

    
x









    Out[17]:





0    100
1      2
2      3
3      4
4      5
dtype: int64

Avoid using copy (is you can) to save memory



In [18]:

    
y = x.copy()



In [19]:

    
x[0]=1



In [20]:

    
x









    Out[20]:





0    1
1    2
2    3
3    4
4    5
dtype: int64



In [21]:

    
y









    Out[21]:





0    100
1      2
2      3
3      4
4      5
dtype: int64



In [22]:

    
x.describe(percentile_width=50)









    Out[22]:





count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64

DataFrame

pd.DataFrame(self, data=None, index=None, columns=None, dtype=None, copy=False)



In [23]:

    
data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(data, columns=["x"])



In [24]:

    
df









    Out[24]:






  
    
      
      x
    
  
  
    
      0
       1
    
    
      1
       2
    
    
      2
       3
    
    
      3
       4
    
    
      4
       5
    
    
      5
       6
    
    
      6
       7
    
    
      7
       8
    
    
      8
       9
    
  

9 rows × 1 columns

Selecting Data



In [25]:

    
df["x"]









    Out[25]:





0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
Name: x, dtype: int64



In [26]:

    
df["x"][0]









    Out[26]:





1

Adding extra columns



In [27]:

    
df["x_plus_2"] = df["x"] + 2
df









    Out[27]:






  
    
      
      x
      x_plus_2
    
  
  
    
      0
       1
        3
    
    
      1
       2
        4
    
    
      2
       3
        5
    
    
      3
       4
        6
    
    
      4
       5
        7
    
    
      5
       6
        8
    
    
      6
       7
        9
    
    
      7
       8
       10
    
    
      8
       9
       11
    
  

9 rows × 2 columns



In [28]:

    
df["x_square"] = df["x"] ** 2
df["x_factorial"] = df["x"].apply(np.math.factorial)
df









    Out[28]:






  
    
      
      x
      x_plus_2
      x_square
      x_factorial
    
  
  
    
      0
       1
        3
        1
            1
    
    
      1
       2
        4
        4
            2
    
    
      2
       3
        5
        9
            6
    
    
      3
       4
        6
       16
           24
    
    
      4
       5
        7
       25
          120
    
    
      5
       6
        8
       36
          720
    
    
      6
       7
        9
       49
         5040
    
    
      7
       8
       10
       64
        40320
    
    
      8
       9
       11
       81
       362880
    
  

9 rows × 4 columns



In [29]:

    
df["is_even"] = df["x"] % 2
df









    Out[29]:






  
    
      
      x
      x_plus_2
      x_square
      x_factorial
      is_even
    
  
  
    
      0
       1
        3
        1
            1
       1
    
    
      1
       2
        4
        4
            2
       0
    
    
      2
       3
        5
        9
            6
       1
    
    
      3
       4
        6
       16
           24
       0
    
    
      4
       5
        7
       25
          120
       1
    
    
      5
       6
        8
       36
          720
       0
    
    
      6
       7
        9
       49
         5040
       1
    
    
      7
       8
       10
       64
        40320
       0
    
    
      8
       9
       11
       81
       362880
       1
    
  

9 rows × 5 columns

`map()`



In [30]:

    
df["odd_even"] = df["is_even"].map({1:"odd", 0:"even"})
df









    Out[30]:






  
    
      
      x
      x_plus_2
      x_square
      x_factorial
      is_even
      odd_even
    
  
  
    
      0
       1
        3
        1
            1
       1
        odd
    
    
      1
       2
        4
        4
            2
       0
       even
    
    
      2
       3
        5
        9
            6
       1
        odd
    
    
      3
       4
        6
       16
           24
       0
       even
    
    
      4
       5
        7
       25
          120
       1
        odd
    
    
      5
       6
        8
       36
          720
       0
       even
    
    
      6
       7
        9
       49
         5040
       1
        odd
    
    
      7
       8
       10
       64
        40320
       0
       even
    
    
      8
       9
       11
       81
       362880
       1
        odd
    
  

9 rows × 6 columns

`drop()`



In [31]:

    
df = df.drop("is_even", 1)
df









    Out[31]:






  
    
      
      x
      x_plus_2
      x_square
      x_factorial
      odd_even
    
  
  
    
      0
       1
        3
        1
            1
        odd
    
    
      1
       2
        4
        4
            2
       even
    
    
      2
       3
        5
        9
            6
        odd
    
    
      3
       4
        6
       16
           24
       even
    
    
      4
       5
        7
       25
          120
        odd
    
    
      5
       6
        8
       36
          720
       even
    
    
      6
       7
        9
       49
         5040
        odd
    
    
      7
       8
       10
       64
        40320
       even
    
    
      8
       9
       11
       81
       362880
        odd
    
  

9 rows × 5 columns

Multi Column Select



In [32]:

    
df[["x", "odd_even"]]









    Out[32]:






  
    
      
      x
      odd_even
    
  
  
    
      0
       1
        odd
    
    
      1
       2
       even
    
    
      2
       3
        odd
    
    
      3
       4
       even
    
    
      4
       5
        odd
    
    
      5
       6
       even
    
    
      6
       7
        odd
    
    
      7
       8
       even
    
    
      8
       9
        odd
    
  

9 rows × 2 columns

Controlling display options



In [33]:

    
pd.options.display.max_columns= 60
pd.options.display.max_rows= 6
pd.options.display.notebook_repr_html = False
df









    Out[33]:





   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
4  5         7        25          120      odd
5  6         8        36          720     even
  ..       ...       ...          ...      ...

[9 rows x 5 columns]

Filtering



In [34]:

    
df[df["odd_even"] == "odd"]









    Out[34]:





   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
2  3         5         9            6      odd
4  5         7        25          120      odd
6  7         9        49         5040      odd
8  9        11        81       362880      odd

[5 rows x 5 columns]



In [35]:

    
df[df.odd_even == "even"]









    Out[35]:





   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

[4 rows x 5 columns]

Chaining Filters

`|` OR



In [36]:

    
df[(df.odd_even == "even") | (df.x_square < 20)]









    Out[36]:





   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

[6 rows x 5 columns]

`&` AND



In [37]:

    
df[(df.odd_even == "even") & (df.x_square < 20)]









    Out[37]:





   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even

[2 rows x 5 columns]

Furter Chaining



In [38]:

    
df[(df.odd_even == "even") & (df.x_square < 20)]["x_plus_2"][:1]









    Out[38]:





1    4
Name: x_plus_2, dtype: int64

`scatter_matrix()`



In [39]:

    
pd.scatter_matrix(df, diagonal="kde", figsize=(10,10));



In [40]:

    
df.describe()









    Out[40]:





              x  x_plus_2   x_square    x_factorial
count  9.000000  9.000000   9.000000       9.000000
mean   5.000000  7.000000  31.666667   45457.000000
std    2.738613  2.738613  28.080242  119758.341137
min    1.000000  3.000000   1.000000       1.000000
25%    3.000000  5.000000   9.000000       6.000000
50%    5.000000  7.000000  25.000000     120.000000
            ...       ...        ...            ...

[8 rows x 4 columns]

Reading Data from CSV/TSV Files



In [41]:

    
url = "http://www.google.com/finance/historical?q=TADAWUL:TASI&output=csv"
stocks_data = pd.read_csv(url)



In [42]:

    
stocks_data









    Out[42]:





       Date      Open      High       Low     Close     Volume
0  11-Aug-14  10579.12  10603.30  10547.21  10596.55  197234714
1  10-Aug-14  10552.48  10614.11  10551.77  10579.12  199773735
2   7-Aug-14  10478.34  10585.38  10478.34  10552.48  202329194
3   6-Aug-14  10450.52  10494.12  10398.25  10478.34  192868941
4   5-Aug-14  10405.81  10501.38  10405.81  10450.52  287651475
5   4-Aug-14  10302.88  10409.47  10290.95  10405.81  223099538
         ...       ...       ...       ...       ...        ...

[241 rows x 6 columns]



In [43]:

    
stocks_data["change_amount"] = stocks_data["Close"] - stocks_data["Open"]
stocks_data["change_percentage"] = stocks_data["change_amount"] / stocks_data["Close"]
stocks_data









    Out[43]:





       Date      Open      High       Low     Close     Volume  \
0  11-Aug-14  10579.12  10603.30  10547.21  10596.55  197234714   
1  10-Aug-14  10552.48  10614.11  10551.77  10579.12  199773735   
2   7-Aug-14  10478.34  10585.38  10478.34  10552.48  202329194   
3   6-Aug-14  10450.52  10494.12  10398.25  10478.34  192868941   
4   5-Aug-14  10405.81  10501.38  10405.81  10450.52  287651475   
5   4-Aug-14  10302.88  10409.47  10290.95  10405.81  223099538   
         ...       ...       ...       ...       ...        ...   

   change_amount  change_percentage  
0          17.43           0.001645  
1          26.64           0.002518  
2          74.14           0.007026  
3          27.82           0.002655  
4          44.71           0.004278  
5         102.93           0.009892  
             ...                ...  

[241 rows x 8 columns]

	x	x_plus_2	x_square	x_factorial
0	1	3	1	1
1	2	4	4	2
2	3	5	9	6
3	4	6	16	24
4	5	7	25	120
5	6	8	36	720
6	7	9	49	5040
7	8	10	64	40320
8	9	11	81	362880

	x	x_plus_2	x_square	x_factorial	is_even	odd_even
0	1	3	1	1	1	odd
1	2	4	4	2	0	even
2	3	5	9	6	1	odd
3	4	6	16	24	0	even
4	5	7	25	120	1	odd
5	6	8	36	720	0	even
6	7	9	49	5040	1	odd
7	8	10	64	40320	0	even
8	9	11	81	362880	1	odd

Tutorial Brief

NumPy

SciPy

Matplotlib

IPython

SymPy

Pandas